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Explicit codes are often used to simulate the nonlinear dynamics of large-scale 
structural systems, even for low frequency response, because the storage and CPU re- 
quirements entailed by the repeated factorizations traditionally found in implicit codes 
rapidly overwhelm the available computing resources. With the advent of parallel pro- 
cessing, this trend is accelerating because explicit schemes are also easier to parallelize 
than implicit ones. However, the time step restriction imposed by the Courant stabil- 
ity condition on all explicit schemes cannot yet — and perhaps will never — be offset 
by the speed of parallel hardware. Therefore, it is essential to develop efficient and 
robust alternatives to direct methods that are also amenable to massively parallel pro- 
cessing because implicit codes using unconditionally stable time-integration algorithms 
are computationally more efficient when simulating low-frequency dynamics. Here we 
present a domain decomposition method for implicit schemes that requires significantly 
less storage than factorization algorithms, that is several times faster than other popular 
direct and iterative methods, that can be easily implemented on both shared and local 
memory parallel processors, and that is both computationally and communication-wise 
efficient. The proposed transient domain decomposition method is an extension of the 
method of Finite Element Tearing and Interconnecting (FETI) developed by Farhat 
and Roux for the solution of static problems. Serial and parallel performance results 
on the CRAY Y-MP/8 and the iPSC-860/128 systems are reported and analyzed for 
realistic structural dynamics problems. These results establish the superiority of the 
FETI method over both the serial/parallel conjugate gradient algorithm with diagonal 
scaling and the serial/parallel direct method, and contrast the computational power of 
the iPSC-860/128 parallel processor with that of the CRAY Y-MP/8 system. 
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1. Introduction 


Nonlinear transient finite element problems in structural mechanics are char- 
acterized by the semi-discrete equations of dynamic equilibrium: 

Mii + f* nt (u, p c , 9) = F T< ( 1 ) 


where M is the mass matrix, u is the vector of nodal displacements, a dot su- 
perscript indicates a time derivative, f n< is the vector of internal nodal forces, 
p c denotes a set of control parameters, 9 is a function of past history of the 
generalized deformation gradients, and F zt is the vector of external nodal forces. 
The solution of Eq. (1) via an implicit time-integration methodology generates a 
nonlinear algebraic system of equations at each time step. The Newton-Raphson 
method and its numerous variants collectively known as “Newton-like” methods 
are the most popular strategies for solving the resulting nonlinear problem. All 
of these algorithms require the solution of a linear algebraic system of equations 
of the form: 


K (u£,) Au‘‘+'> = r(u«) 


*n+l 

where 


( 2 ) 


Au 


,(*+!) = u (*+l) „(*) 


n+1 


n+1 


— U 


n+1 


and where the subscript n refers to the n-th time step, the superscript k refers 
to the k-th. nonlinear iteration within the current time step, K is a symmet- 
ric positive approximate tangent matrix that includes both mass and stiffness 
contributions, and Au^ 1 * and r(u ( n ^ 1 ) are respectively the vector of nodal dis- 
placement increments and the vector of out-of-balance nodal forces (dynamic 
residuals). Most if not all finite element production codes use a direct method — 
that is, a method based on a factorization scheme — for solving the linearized 
problem (2). However, for large-scale three-dimensional structural problems, di- 
rect solvers entail memory and CPU requirements that rapidly overwhelm the 
largest of the computing resources that are currently available. As a result, even 
low frequency dynamics problems are often solved with explicit codes in order 
to avoid the extreme demands placed on computer resources by the repeated 
factorizations found in implicit time-integration methodologies. This issue has 
been well addressed by Hughes, Ferencz and Hallquist [l] who have proposed the 
well celebrated Element-By-Element (EBE) Preconditioned Conjugate Gradient 
(PCG) algorithm (see also Hughes, Levit and Winget [2]) as a means to allevi- 
ate the difficulties associated with direct solution methods. Moreover, with the 
advent of parallel processing, the popularity of explicit schemes and iterative solu- 
tion strategies has been rapidly increasing (see, for example, Hajjar and Abel [3], 
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Farhat, Sobh and Park [4], Belytschko, Plaskacz, Kennedy and Greenwell [5], and 
BifHe [6]) because these methodologies (a) are easier to parallelize than implicit 
schemes and direct solvers, and (b) they usually induce short range interprocessor 
communications that are relatively inexpensive, while the factorization methods 
used in most implicit schemes induce long range interprocessor communications 
that often ruin the sought-after speed-up. However, the time step restriction 
imposed by the Courant stability condition on all explicit schemes cannot yet 
and perhaps will never — be offset by the speed of parallel hardware, and 
many iterative solvers often fail when applied to systems arising from the analy- 
sis of large-scale flexible structures. Therefore, it is essential to develop efficient 
and robust alternatives to direct methods that are also amenable to massively 
parallel processing, because unconditionally stable implicit schemes are compu- 
tationally more efficient than explicit time-integration methodologies at simu- 
lating low-frequency dynamics. The EBE/PCG algorithm developed by Hughes 
and his co-workers [1, 2] is such an alternative which additionally utilizes the 
structure inherent in finite element formulations and implementations. Here, we 
propose another alternative which is driven by substructuring concepts and which 
achieves a greater improvement over the CG algorithm with diagonal scaling than 
reported in [1] for the EBE/PCG scheme. However, unlike EBE methods, the 
proposed methodology requires the assembly and factorization of substructure 
level matrices and therefore requires more storage than the EBE/PCG solution 
algorithm. 

A good balance between direct and iterative solution algorithms is provided 
by Domain Decomposition (DD) methods which usually blend both of these so- 
lution strategies. A well designed DD method requires less storage than direct 
solvers and converges faster than other purely iterative algorithms when applied to 
highly ill-conditioned systems (see, for example, the collection of papers compiled 
in [7]). Moreover, in their simplest form, DD methods have an explicit character 
because they effectively share information only between neighboring subdomains, 
which makes them very attractive for parallel processing. The method of Finite 
Element Tearing and Interconnecting (FETI) is a DD algorithm based on a hybrid 
variational principle that was developed by Farhat and Roux [8] for the parallel 
solution of self-adjoint elliptic partial differential equations. It combines a direct 
solver for computing the incomplete subdomain displacement fields with a PCG 
algorithm for extracting the dual tractions at the subdomain interfaces. This 
piethod was shown to outperform direct solvers on both serial and coarse-grained 
multiprocessors such as the CRAY Y-MP/8 system, and to compare favorably 
with other DD algorithms on a 32 processor iPSC-2 (Farhat and Roux [9]). In this 
paper, we present an extension of the FETI methodology to structural dynamics 
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problems. In particular, we construct two subdomain-by-subdomain precondi- 
tioners for the interface problem which have direct mechanical interpretations. 
We mathematically discuss their computational properties and numerically as- 
sess their relative performances. We also report on some recent developments in 
the FETI process pertaining to global residual estimators, loss of orthogonality, 
and numerical scalability with respect to the mesh and subdomain sizes. Finally, 
we analyze some serial and parallel performance results obtained on the CRAY 
Y-MP/8 and the iPSC-860/128 parallel processors — which are representative 
of today’s two extreme parallel architectures — for all of the transient FETI 
method, a parallel active column direct solver, and a parallel implementation of 
the CG algorithm with diagonal scaling. These performance results show that all 
of the mentioned parallel algorithms exhibit a different type of bottleneck, but 
in most cases, the FETI method is shown to significantly outperform both the 
parallel CG algorithm with diagonal scaling and the parallel direct solver. 

2. A transient FETI methodology 

2.1. Tearing and interconnecting 

Let SI denote the volume of the structure to be analyzed and {SI 3 } 3 ^^' 
denote a partitioning (tearing) of SI into N s substructures or subdomains. We 
denote by T 3 and ry 9 , respectively, the portion of the boundary of SI 3 where 
surface tractions t 3 are prescribed, and the interface boundary between SI 3 and a 
neighboring SI 9 . A weak form of the equations that govern the dynamic undamped 
response of the global structure can be written in a subdomain-by-subdomain 
form as: 

/ (8u 9 p a ii a + 6£ 3 a a ) dSI = [ 6u a f a dS 1+ / 8u a i a dT 
JQ* Jft* JY* t 

Y I 6u a \ a ’ q dr (3) 

(s = 1,...,N 3 ) 

subject to the inter-substructure continuity constraints: 


u a = U q - u a = u 9 ; il a = il q on * ^(J ^{Sl'f)^} (4) 

3=1, g=l 


In Eqs. (3) and (4) above, the superscripts * and 9 refer to SI 3 and SI 9 , re- 
spectively, a dot indicates a derivative with respect to time, a a and £ a are the 
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stress and strain tensors, p 3 is the mass density, u is the displacement field, / is 
the forcing field, and X 3,q is a Lagrange multiplier function which represents the 
interface tractions that maintain equilibrium between XI s and a neighboring Xl q 
(interconnecting). The first of the inter-substructure continuity constraints (Eqs. 
(4)) can be dualized as: 


/ 6X 3 ’ q (u 3 — u q ) dT = 0 (5) 

' n n 9 

If the original mesh of the global structure does not contain incompatible ele- 
ments, the subdomains { $7 '’ } * Z j are guaranteed to have compatible interfaces 
since these subdomains are obtained by partitioning the global mesh into N s 
submeshes. Therefore, we assume in the sequel that discrete Lagrange multipli- 
ers are introduced at the subdomain interfaces to enforce the inter-substructure 
displacement continuity. The piece-wise continuous approximation of the La- 
grange multipliers \ 3 ’ q in Eq. (5) is treated in Farhat and Geradin [10] for static 
problems. Using a standard Galerkin procedure where the displacement field is 
approximated by suitable shape functions as: 


u 3 = Nu 8 


( 6 ) 


and linearizing the equations of dynamic equilibrium around u n+ i, Eqs. (3) and 
(5) are transformed into the following algebraic system: 


M Au n _|_ 1 + K Au n+1 

s=N a 

5 (* + i) 
n+1 


r n+l _ D A n+1 


s — 1, N s 


E B * Au 


= 0 


3=1 


(?) 

where the superscript T indicates a transpose, M s and K‘* are respectively the 
subdomain mass and tangent stiffness matrices, Au 8 ^ 11 and are respec- 
tively the subdomain vector of nodal displacement increments and the subdo- 
main vector of out-of-balance nodal forces, B a is a boolean matrix that describes 
how ft is connected to the global interface, and is the vector of Lagrange 

multiplier unknowns at iteration k + 1 and step n + 1. For linear elastostatic 
problems, Eqs. (7) above corresponds to the discretization of a saddle-point vari- 
ational principle whose mathematical and computational properties are analyzed 
in Farhat and Roux [8, 9], 
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2.2. Time integration 


The linearized subdomain equations of equilibrium (7) 


compact form as: 


M 0 
0 0 


Aii! 


(*+ 1) 


n+l 
;(*+!) 
Wl 

where 


+ 


B 


b t 


\ (^+1) 
A n+ 1 


M = block diagonal [M^ 

K = block diagonal [k <(1> 
“ = [B^ ... B< N >>] 

. Au 


B 

,(*+ 1 ) _ 

i+l 

.<*> 


can De rewritten m 


r r (*) i 


1 n+l 
0 


M (JV,) ] 

K t(N,) 1 


= [Au 


n+l 


n+l 


= [ 


i(*) 

r n+l 


P N. W 

n+l 


1V,<* +1 > 1 T 
n+l J 
T 


( 8 ) 


In a companion paper [11], we show that the constraint equation B Au^ 1 * = 0 
introduces a destabilizing effect in the system that can be analyzed by inves- 
tigating the behavior of the time-integration algorithm at infinity. In par- 
ticular, we prove that the Newmark integrator without numerical dissipation 
(P = 1/4, 7 = 1/2) presents a weak instability which is excited for any time 
step value. We also show that the Newmark-/? damping scheme stabilizes the 
integration but at the cost of reducing the accuracy to only first order. Finally, 
we describe in [11] three implicit time-integration algorithms for integrating Eqs. 
(8) that are unconditionally stable and second-order accurate. These algorithms 
axe: 

(Al). A substructure version of the Hilber-Hughes-Taylor [12] algorithm with 
o: < 0. The numerical dissipation of this second-order accurate algo- 
rithm is shown to cure the weak instability caused by the constraints. 

(A2). A substructure version of the Newmark integrator without numerical 
dissipation (/? = 1/4, 7 = 1/2) but with a constraint on the substructure 
acceleration increments B Aii[j^.j — 0, rather than the substructure 
displacement increments. In particular, we show in [11] that the latter 
constraint does not cause the inter-substructure displacement constraint 
to drift away. 

(A3). A substructure version of the Newmark integrator without numerical 
dissipation (/? = 1/4, 7 = 1/2) and with a constraint on the substruc- 
ture displacement increments B Au^, 1 * = 0, but written in terms 
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of the displacement field increment Au^ 1 * and the momentum incre- 


ment Av 


( * +1) = M Aii (fc+1) 


n-fl 


■* n +i . The latter algorithm possesses desirable 
features for reducing round-off error propagation. 

Here, we select the time-integration algorithm (A3) and summarize it in 
order to keep this paper self-contained. Let denote the momentum in- 

^ T 2 

crement at iteration k + 1 and at the midpoint between steps n and n + 1: 


Av‘* + .‘ ) = M 

n-fj 


( 9 ) 


The midpoint subdomain increments Au^j and Av^i * axe integrated as 


follows: 


Au , i 


A 5 (fc + 1) 

AvLx 


A t . jk+ 1) 

= — Au 3 
2 


"+5 


At 

~2 


— A . s (*+i) 


( 10 ) 


where At is the time step. Substituting Eqs. (10) into the dynamic equations of 
subdomain equilibrium (7) written at the midpoint step n + \ leads to: 


AV n+! 


2 _ Ka . „(fc+i) 

— M 3 Au 3 x 

At n+ 2 


( 11 ) 


and 


.(*+!) 


(M 3 + ^ )a<; 


At 2 


4 

s=N e 


E B-Au^r 


4 «V B 




n+ 


i’) s = 


( 12 ) 


S =1 


U 




.(*) 


(*+i) 


U “ ' i + AU n+± 


n+, 


After Eqs. (12) are repeatedly solved for all nonlinear increments, n 3 t is found 

n +i 

and the time derivative of each subdomain momentum can be obtained via back- 
substitution in Eq. (1) as: 


f n+± 




« +i ,Pc,0) 


(13) 


Finally, the subdomain solutions are advanced as follows: 


u 


»+l = 2< + , + < 


'n-f 1 


= V 




+ 


At . 


(14) 


»+* 
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REMARK 2.2.1. Since the solution of Eqs. (12) will contain round-off 

3=N 

errors, we implicitly invoke the equilibrium condition ^B 5T A (H 1 1) = 0 and 

3=1 n+ 2 

directly compute the momentum increment in assembled form as: 


v„+i = (15) 

Therefore, our actual code utilizes Eq. (15) rather than Eq. (13). 

The above subdomain-by-subdomain time-integration algorithm is second- 
order accurate and unconditionally stable (see [11] for a proof). Its computational 
cost is dominated by the solution of Eqs. (12) which can be reduced to the 
following interface problem: 


S —N. A ±2 a=N, . 2 

(Eb*(M- + ^-K'-)-‘B“ T )A« +1) = £ + (16) 

S=1 3=1 4 2 


S=N. 


For large problems and fine mesh decompositions, it is not feasible to explicitly 
assemble the interface operator: 


F; = + (17) 

3=1 3 4 

whose size is equal to the total number of degrees of freedom on the subdomain 
interfaces. This implies that a direct method is not attractive to solve Eq. (16). 
The only efficient numerical method for solving Eq. (16) is that of conjugate 
gradients because once {(M + ^ = f'”’ have been factored in parallel, 

matrix- vector products of the form F can be efficiently performed using only 
parallel subdomain-by-subdomain forward and backward substitutions. 

The remainder of this paper focuses on the parallel solution of Eq. (16) via 
the PCG algorithm. 


3. Spectral properties of the interface problem 

s=N, ,-t T 

The interface matrix B K B corresponds to the discretization of 

3=1 

a compact operator which maps the normal components of the stress tensor along 
the subdomain interfaces, onto the traces on these interfaces of the corresponding 
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subdomain displacement fields. Recently, Roux [13] has shown that because of 

sz zJ^‘ ,. _1 T 

this compactness, the eigenvalues of the matrix 2] B K B are well sep- 

s= 1 

arated and have a higher density towards the low end of their spectrum. For a 

<9 — N s 2 g T 

fixed At, we can show that ^ B a (M 5 + ^-K* ) 1 B* has similar eigen prop- 

5=1 

erties. The objectives of this section are: (a) to numerically illustrate the spectral 

3 — N g 2 s T 

properties of the transient interface operator ^ B*(M 5 + ) _1 B S , and 

5=1 

(b) to highlight the impact of these spectral properties on the convergence rate 
of the CG algorithm. 

Consider the three-dimensional two-subdomain cantilever problem depicted 
in FIG. 1. The beam has a square cross section and a 10/1 length/height aspect 
ratio. Two finite element meshes are constructed using 8-node brick elements. 
The first mesh, Mi, contains 2400 internal degrees of freedom (d.o.f.) and 192 
interface d.o.f. The second mesh, M 2 , is finer: it contains 16320 internal d.o.f. 
and 672 interface d.o.f. 



FIG. 1 A three-dimensional cantilever problem 
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The eigenvalues of the transient interface operator F / (17) associated with the 
above problem are computed using a consistent mass formulation and a time step 
equal to one tenth of the sixth period of the structure. The distribution of these 
eigenvalues is depicted in FIG. 2 for mesh M\, and in FIG. 3 for mesh M 2 ■ 




FIG. 3 Spectral density of F j for mesh M 2 
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The x axis of the bar diagrams depicted in FIG. 2-3 represents the eigenvalues 
scaled by the smallest eigenvalue, and the y axis the number of eigenvalues per 
interval. For both meshes, F 7 is shown to have a few large eigenvalues that 
are well separated from the small ones. Figure 3 suggests that this separation 
amplifies when the mesh size h ► 0. Indeed, one can mathematically prove that 
the large eigenvalues of F 7 stabilize when the mesh size decreases, while its small 
eigenvalues accumulate towards zero. Essentially, this is because F 7 involves 
the inverses of the pencils (M s , K* ) and therefore its high modes correspond to 
physical modes while its low modes correspond to mesh modes. 

The spectral distribution of the interface problem associated with the FETI 
methodology has important consequences on the rate of convergence of the CG 
algorithm. During the first iterations, the conjugate gradient algorithm mostly 
captures the eigenvectors associated with the large eigenvalues. Since the high 
numerical modes of F 7 correspond to the low physical modes of the structure and 
since F j has only a few relatively high eigenvalues, the CG algorithm applied to 
the solution of Eq. (16) quickly gives a good approximation of the displacement 
increments. Intuitively, one can imagine that after the few relatively high modes 

of the interface operator are captured, the “effective” condition number of F 7 
that is the ratio of its largest uncaptured eigenvalue over its smallest one — 
becomes significantly smaller than the original condition number of F 7 , which 
accelerates the convergence of the CG algorithm. A detailed analysis of this 
superconvergence behavior of the CG algorithm in the presence of well separated 
eigenvalues can be found in the work of van der Luis and van der Vorst [14], 

The impact of the spectral distribution of the transient interface operator F 7 
on the convergence rate of the CG algorithm is highlighted in TABLE 1 which 
reports the number of iterations to achieve convergence for the FETI methodol- 
ogy described in this paper and for the classical Schur complement method. The 
transient interface operator resulting from the latter DD method (static conden- 
sation) is denoted here by S 7 . While both F j and S 7 are shown to have identical 
two- norm condition numbers (/^(Fj) = ^ 2 (S 7 )), the CG algorithm applied to F 7 

is shown to converge twice as fast as when applied to S 7 . This clearly demon- 
strates that conditioning is not always the only factor governing the convergence 
rate of the CG algorithm. 


11 



TABLE 1 

Three-dimensional two-subdomain cantilever problem 


Mesh Mi : 2400 internal d.o.f — 192 interface d.o.f. 
Mesh M 2 : 16320 internal d.o.f — 672 interface d.o.f. 


Convergence criterion: 


~t ~t 

mesh « 2 (F/) « 2 (S j) # of iterations # of iterations 

(CG-FETI) (CG-Schur) 

Mi 58.0 58.0 14 25 

M 2 55.2 55.2 14 33 


Another important consequence of the spectral distribution of F, is that, for 
a fixed mesh partition, the rate of convergence of CG applied to the solution of 

Eq. (17) is weakly dependent on h, even though the condition number of Fj varies 

as 0(l/h). This is because once the few h-dependent high eigenvalues of F^ have 
been captured, the superconvergence phenomenon “kicks-in” independently of h. 
This is clearly demonstrated in TABLE 1 where the CG-FETI algorithm is shown 
to converge with the same number of iterations for both the coarse mesh Mi and 
the fine mesh M 2 . Similar results are reported in TABLE 2 for a four-subdomain 
bending plate problem. The plate is clamped at one end and subjected to a 
uniform pressure (FIG. 4). It is discretized with 4-node shell elements and with a 
consistent mass formulation. The time step is set to one tenth of the sixth period 
of the structure. 
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FIG. 4 A four- subdomain bending plate problem 


TABLE 2 

Four-subdomain bending plate problem 


Discretization: N x N where N = j 
Algorithm: CG-FETI 

Convergence criterion: HK'iQ nu^,"-r(u<»,)||, 
6 


< 10 


-4 


h 

# of d.o.f. 

# of interface d.o.f. 

# of iterations 

1 

10 

605 

55 

43 

1 

20 

2205 

105 

54 

1 

40 

8405 

205 

74 

1 

80 

32805 

405 

75 

1 

160 

129605 

805 

75 




All of the above results confirm that for all practical purposes and for a fixed 
mesh partition, the rate of convergence of the CG algorithm applied to the so- 
lution of the interface problem associated with the transient FETI methodology 
can be considered independent of the mesh size h. We refer to this important 
property as the numerical h scalability of the FETI methodology, where “numer- 
ical” is used to differentiate from the well-established parallel scalability of the 
proposed computational method, and the “h” is used to remind the reader that 
this result is valid for a fixed mesh partition. 

4. Preconditioning with a trace operator 

4-1. Sum of the projections of the inverses 

~t a=N, 

Because the interface operator Fj = ^ B 3 (M 9 + ) *B 3 essen- 

3=1 

tially assembles the projections on the subdomain interfaces of the inverses of the 
subdomain transient operators (M 3 + ^-K**), it is attractive to consider the 
following subdomain-by-subdomain parallel preconditioner: 

( -i s=N, A / 2 

T j = J2 B-(M* + ^-K |, )B' r (18) 

3=1 

which essentially assembles the projections on the subdomain interfaces of the 
independent subdomain transient operators (M 3 + ) themselves. How- 

ever, we have found that for many static problems the static equivalent of the 
above preconditioner did not much improve the condition number of the interface 
problem (Farhat and Roux [8]), and in some cases, it slightly degraded it (Roux 
[13]). It has also been suggested to us (Mandel [15]) that the condition num- 

~t _1 ~t 

ber of Tj F j still varies as 0(1/ h), which should not make Tj an attractive 
preconditioner. Yet, we have recently reported on an extensive set of numerical 

results for realistic structural problems where the static equivalent of T t was 
shown to significantly improve the convergence rate of the CG algorithm for static 
problems (see for example Farhat and Roux [8, 9, 16] and Farhat, Felippa and 
Militello [17]). All of these observations can be reconciled within the theory of 
“superconvergence” discussed in Section 3. For this purpose, we consider again 
the two-subdomain cantilever problem depicted in FIG. 1. The distribution of the 

eigenvalues of its Tj Fj operator computed using a consistent mass formulation 
and a time step equal to one tenth of the sixth period of the structure is depicted 
in FIG. 5 for mesh M 2 . 
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By comparing FIG. 3 and FIG. 5, the reader can observe that essentially 

amplifies the separation between the clusters of small and large eigenvalues of 
the interface problem, which accelerates the convergence of the CG algorithm. 
Following the reasoning outlined in Section 3, the reader can also conclude from 
FIG. 3 and FIG. 5 that after the few large eigenvalues of the interface problem 
are captured, the “effective” condition number of T/ Fj is much smaller than 

that of Fj. This “superconvergence” behavior is highlighted in TABLE 3 below 
for the three-dimensional two-subdomain cantilever problem. 
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TABLE 3 


Three-dimensional two-subdomain cantilever problem 

Mesh Mi : 2400 internal d.o.f — 192 interface d.o.f. 
Mesh M 2 : 16320 internal d.o.f — 672 interface d.o.f. 


Algorithm: CG-FETI 


Convergence criterion: 


IlKdi^) AU^ 1 > -r( u<;» , ) 1 1 2 


ll r (U ( „*Ji)lla 


< 10 


-4 


mesh # of iterations (F 7 ) # of iterations (T 7 F 7 ) 

Mi 14 6 

M 2 14 6 


4-2 Mechanical interpretation and parallelism 

At each step q of the PCG algorithm, the preconditioning phase using T 7 
can be written as: 


Solve T 7 z ? = r ? 
where 


s=N, 


f q 


= E 

3 = 1 
S=N . 


~s q 

r 


(19) 


a +2 

= y . B '( M> + - b-X+V’') 

3 = 1 


and the solution of Eq. (19) can be written as: 

3 ~Ns a +2 

z ? = + (20) 

3 = 1 4 

The q - th residual r g corresponds to the jump of the displacement increments 
across the subdomain interfaces (see Section 7.1.). Clearly, the above precon- 
ditioner is intrinsically parallel as it involves only subdomain-by-subdomain in- 
dependent computations. Moreover, it is economical because it only requires 
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matrix-vector products of sizes equal to the subdomain interfaces. From a me- 
chanical viewpoint, solving Eq. (19) corresponds to finding a set of “equivalent” 
interface forces that can reproduce the imposed jump of the displacement incre- 
ments. Therefore, the problem described by Eq. (20) is indeed an approximate 
inverse of the interface problem. 


5. Preconditionig with a dual operator 

5.1 Sum of the exact inverses 

A better approximation of F j can be obtained by approximating the inverse 
of the sum by the sum of the exact inverses — that is, by assuming that: 


s=N a 


[£ B'(M> + ^V)- 1 B- r r* M + (21) 

* = 1 4 3=1 4 


s=N. 


which leads to the following preconditioner: 


3=N. 


d; = E[ B, (M* + ^V)- i B* T r i 


(22) 


3=1 


If the subdomain mass and stiffness matrices and the subdomain displacement 
field are partitioned as: 


II 

s 

M 3 - 

m* 6 ‘ 

, K 3 = 

*5 
1 



[m 3 6 

M s bb . 


l K ib 

k :,j 


and u 3 = 


u 


L u 6J 


where the subscripts , and b respectively refer to the internal and interface bound- 
ary degrees of freedom, the inverse of ) can be written as the solution 

of the partitioned matrix equations: 


M& + 

M 

which yields: 


A t 2 T/3 

ii r 4 

il 


M’ 6 + a tb 


A k 

Af 6 


AS 

Af 


I O 
O I 


(23) 


A? s = lMl i + ^KU-(M,U~K’ b ) T (M 

It follows that: 


At 2 


A/ 2 

(24) 


[B 3 (M 3 + — K*')-^^ 1-1 


1 = [[O 11 

A’. A s 1 

A ii A ib 

. T® 

o 

L 1 J 

A . A 3 

A ib A bb J 

I 


1-1 


= A 


bb 


(25) 
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which implies that: 


[B a (M s + 

= (Mf, + 


At 2 


-l 


K 9 


At 2 . 


At 2 . 


») - (M?, + — K* s f(M', + + 


Ai ! 


K*i) 

(26) 


The right hand side of Eq. (26) above is the Schur Complement matrix associated 
with the subdomain transient operator (M 5 -f ^pK 4 *). Therefore, Eq. (26) 
demonstrates that the preconditioning operator: 



s=N t 


E(m 


3=1 


At 2 


At 2 


At 2 . 




At 2 


K?*) 


(27) 


and the interface operator F 7 


s=N, 

£ B*(M S 


+ ^-K* ) *B S are dual oper- 




3=1 

ators. Based on this duality and on the mathematical framework presented in 
Bjordstad and Widlund [19], it can be shown that away from crosspoints — that 


is, points where more than two subdomains intersect, the preconditioner D f 

is such that the condition number of D 7 F 7 is independent of the mesh size h. 
Moreover, D 7 is also an intrinsically parallel preconditioner since it involves 
only subdomain-by-subdomain independent computations. 


5.2 Mechanical interpretation and computational issues 

The solution of the interface problem (16) via the PCG algorithm (here with 
the dual preconditioner) requires performing at each iteration q the following 
computations: 


Multiply 


P 6 = 


F/P l 


■ — ' l 

Solve D 7 z 


? - i ^9 


(main loop of CG ) 
(preconditioning step) 


(28) 


Here, p q b denotes the search direction computed during the g-th iteration, and 
r q b denotes the q - th residual which represents the jump of the displacement in- 
crements across the subdomain interfaces. All computations summarized in Eqs. 
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(28) can be performed at the subdomain level as follows: 


Pi 


Z h = 


B*(M* + ^K , ')- 1 B*’’pf 
[(MJ S + ~KU 


( 29 ) 


- (M’, + (M* + ^!k' + ^K' k )]fJ 


The first of Eqs. (29) can be re-formulated as a problem with Neumann boundary 
conditions (indicated between braces {}): 


M' + £fK’ {i m;, + 


l M » + ^r K ‘i + 


Ar 

4 

At 2 


K 3 


ib 

'3 

L bb 



r - 3 q i 

p? 


1 

o 


Lp6 J 




(30) 


The solution of this problem represents the trace on the interface boundary 
of fl s of the displacement field resulting from prescribing the interface traction 
forces . 

The second of Eqs. (29) can be re-formulated as a problem with Dirichlet 
boundary conditions (indicated between braces {}): 


■ M’ ib + 1 [ zf ] 

+ M^ + ^Kj [{if }. 

and which can be solved in two steps: 



(31) 


Solve 


Evaluate 


(m;, + ~k;,)z‘' = 

At 2 . 


At 2 

(M* 6 + ^K* 6 )rf 

At 2 . 


(32) 


< = (M? 6 + K® 6 )zf -f (Mjj + -^—K£ 6 )r a 


-s q 

b 


Therefore, the preconditioning step reformulated in Eq. (31) corresponds to find- 
ing in each subdomain Q. 3 the traction forces that are needed to prescribe the 
interface boundary displacement increment jump fj*. 

The ^mechanical interpretation of the CG algorithm with the dual precondi- 
tioner D f is straightforward. Within each iteration, a Neumann problem is first 
solved: traction forces are applied at the subdomain interfaces, and a jump in 
the displacement increments is computed. Next, this computed jump is imposed 
at the subdomain interfaces and new interface traction forces are computed. The 
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successive iterations drive this jump to zero and upon convergence the computed 
Lagrange multipliers (traction forces) enforce the inter-subdomain displacement 
continuity. 


On the computational side, it should be noted that even though the block 
i) governing Eq. (30) is embedded in the system governing Eq. (32), 
this block is stored and factored twice. The reason is that Eqs. (30) and (32) 
have two different optimal numberings for the internal degrees of freedom that 
lead to two different matrices. For a small number of subdomains, this drawback 
becomes a severe issue as the cost of storing and factoring (M^ + ^-K*-) is 
important. For finer mesh partitions, this additional storage and computational 
burden becomes less significant as the size of every subdomain and the cost of 
factoring (M?- + ^-K^) becomes negligible when compared to the size of the 
global interface problem and the cost of the main CG iterations, respectively. In 
particular, if the tangent operator is not reconstructed at each nonlinear iteration, 
the computational overhead introduced by the dual preconditioner is acceptable. 
However, it should also be noted that an iteration with the dual preconditioner 
is more expensive than an iteration with the trace preconditioner as the former 
one involves in every subdomain a pair of forward/backward substitutions of sizes 
equal to the number of internal subdomain degrees of freedom, while the latter 
one involves in every subdomain a matrix-vector product of a size equal to the 
number of subdomain interface degrees of freedom. 


REMARK 5.2.1. Global and/or local symmetry effects such as mesh and material 
symmetry, but not necessarily load symmetry , can accelerate the convergence of 
the FETI method with the dual preconditioner. For example, in the case of a 

~t _l ~t -1 

two-subdomain problem, symmetry implies that D j = 4 F ; , and therefore 
the PCG algorithm converges in one iteration. 


6. Loss of orthogonality 

Matrices with the spectral pattern discussed in Section 3 have been shown 
to cause a rapid loss of orthogonality of the direction vectors (Parlett [18]). For 
these matrices, increasing the numerical precision does not restore the orthogo- 
nality of the search vectors because in this case the propagation of the errors is a 
polynomial function of the ratio between consecutive eigenvalues. 

The loss of orthogonality of the search directions during the solution of the 
interface problem (16) has a disastrous consequence on the rate of convergence of 
the PCG algorithm. To remedy this problem, we introduce a reorthogonalization 
procedure within the PCG algorithm which however, in order to determine the 
new direction vector, entails at each iteration q the following additional burden: 
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(a) the storage of the direction vector p 9 and the product F^p 9 . 

(b) the evaluation of q dot products of the form rJ[F^p 9 ] where [F jp 9 ] is readily 
available and 1 < j < q, and of an nj x j matrix-vector product where n/ is 
the number of interface unknowns. 

Clearly, such a reorthogonalization procedure is not feasible if introduced 
during the solution via the PCG algorithm of a global finite element problem, as 
it would require unreasonable amounts of memory and CPU. However, it is quite 
affordable within the context of a domain decomposition algorithm as it applies 
only to the interface problem. In particular, the reader should note that the 
additional computational costs outlined in (b) axe small compared to the cost of 
the pair of forward and backward substitutions that are required at each iteration 
q of the PCG algorithm in order to evaluate the product F^p 9 . 

The efficiency of the reorthogonalization procedure discussed above is high- 
lighted in TABLE 4 for the four-subdomain bending plate problem introduced in 
Section 3. 


TABLE 4 

Four -subdomain bending plate problem 


Algorithm: PCG-FETI 


s=N. 


3=1 


Preconditioner: T/ = B a (M a + ^pK*’)B 


n . ||K ) Au ( *+ 1 ) -r(u (fc ;’ )|| 2 

Convergence criterion: ^ — n +* ; 

llrdOlL 

Multiprocessor: iPSC-860 (4 processors) 


< 10 


-3 


h 

# of iterations 

# of iterations 

CPU 

CPU 


(without reorth.) 

(with reorth.) 

(without reorth.) 

(with reorth.) 

1 

20 

42 

22 

5.5 s. 

3.7 s. 

1 

40 

66 

24 

33.2 s. 

16.5 s. 


REMARK 6.1. In practice, the number of direction vectors that are stored for 
reorthogonalization is determined by the memory space that is available after 
all of the other storage requirements of the transient FETI method have been 
satisfied. When only a few directions can be stored, a partial reorthogonalization 
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is implemented. In this case, the optimal strategy consists in storing the first few 
directions instead of the most recent ones, because the subspace generated by the 
first directions is closer to the subspace associated with the highest eigenvalues. 

At this point, we mention that for many of the realistic structural problems 
that we have solved with the FETI method, we have observed that reorthogonal- 
ization was necessary for convergence. All of the numerical results presented in 
the remainder of this paper were obtained with either partial or full reorthogo- 
nalization. 


7. Global residual and numerical precision 

7.1. Interface v.3. global convergence 

At every iteration q of the PCG algorithm applied to the solution of the 
interface problem (16), the computed and Au 3 ^ 09 verify: 


(M 3 + 


At 2 


K < ‘)Au 


3 (*+i)0 
n+ ^ 


At 2 


(r a(k) -R 3 T A (fc+1), i 

^ P »+i B A n+I ) 


s = 1,...,N 3 (33) 


so that: 


Au 


n+ -5 


At 2 




(m s + T r ) _lr S - + t^aIS 


At 2 


At 2 . 


”t a \ — 1 id 3 T \ (fc+l) ? 




(34) 

Using (34), the jump in the displacement increments at the subdomain interfaces 
can be written as: 


s= N, 

E 

3 = 1 


n+i 


-GA + L 


where 


G 


L 


B 3 (M S + ^K ( ’) -1 B sT 


3=N s 


3=1 

s=N a 


tEb-(M'4k')-'<? s 


3=1 


(35) 


From (16) and (35), it is clear that the gradient of the interface problem (16) 
is indeed the jump in the displacement increments at the subdomain interfaces. 
Therefore, the residual of the interface problem || - GA + L|| gives the order 
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of magnitude of the error in ||Au n _|_i || and not the order of magnitude of the 
error in the global residual ||K*(u»+i) Au^ 1} - r(u£ l fc J 1 )||, where K = 

M + ^-K ( . Moreover, since the condition number of K* varies as 0(l//i 2 ) while 
the condition number of Fj varies as 0(l//i), the convergence criterion: 

||-GA ? +L|| < e II - GA° + L|| (36) 

does not guarantee that: 


IIK'tuS,) Aui‘f 




^Ve have observed that for most structural problems, the relative global residual 
(37) is typically 10 2 to 10 3 larger than the relative interface residual (36), which 
clearly indicates that the convergence of the FETI methodology should be based 
on the global criterion (37). 

Unfortunately, evaluating (37) at every PCG iteration ([ requires perform- 
ing f a global matrix- vector multiply in order to compute the global residual 

11^- ( u n+i) ^ u n+ i — r ( u n+i)ll- This would double both the computational 
and communication costs of a PCG iteration Therefore, we had to develop an es- 
timator for the global residual that is accurate and computationally economical. 

Using Eq. (12) and the partitioning introduced in Section 5.1, the restriction 
of the global residual to every subdomain fi 3 can be written as: 


K (u: 


) Au s 


(*+i)9 


-K<+i) = 


n-f 1 > 


n+i 


(M?, + ^k;,) (au 


(MJi + ^-K U) (Au^r: - Au 


a(*+i) ’ 
n+ 5 66 

F 9 

66 


- Au s 


(* + !)« 

n +I 66 
a (* + i)9 


66^ J 
(38) 

, A s(* + i) 9 A ,(*+i)9 ... 

wnere Au ^ — uu , is the jump in the displacement increments at the 

2 UO 1 2 00 

subdomain boundary interface. Clearly, the reaction forces (M- 6 +— K- 6 ) (Au s( ‘) ,)? - 

n+ ’ 66 


A s (t + 1 > 9 \ 1 , 

^ U n+! bb ) 3X6 zero except on those degrees of freedom which are connected to the 
interface ones. Moreover, equilibrium suggests that ||(M- 6 +^K 3 6 ) (Au 3( *t l)9 - 

^ u n+i 66 )ll an< * II(Mm+^-Kj 6 ) ( A 1 ' 66 — A u ^ ( + i ) ^)|| are of the same mag- 
nitude order, so that one can reasonably approximate IlK^u 3 ^) - r(u 3 ' 


(*) 

i+i- 
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as follows: 


II"- ( U n+l) A U „+1 r ( U n+l ; 


leaf 


At 2 


= li( M » + =-K‘„) (Au^f’.. - A<‘" 


(* + !)« 


2 66 


n + 


2 66 


= ||B*(M* + ^-K-)B’ r (Au£«>’ ( - A< , ;;”’ 1 )|| 

(39) 


From (35) and (19) it follows that: 


* .(* + I) 9 A + 

Au ■* -Au* + , = — 

At 2 




B S (M S + - B sT \ (k + 1}9 ) 

d U +2 7 


(40) 

In assembled form, the above estimator becomes: 
l|K‘(u«,) Au^ 1 ' - rfuS,)!!" 1 

= ~ £ B*(M* + ^K*)B* r (M’ + ^K‘')-'(r;'*i - B*A 


At 2 


3=1 

*<* 


which in view of Eqs. (19-20) can be written as: 


IIK'fuWjAu^ 1 )' - r(u [%, 


\est 


At 2 


z 9 = 


At 2 


(41) 


T 7 r 9 (42) 


Therefore, we replace the convergence criterion (37) with: 

z 9 < ez° 

— that is: 


Tj r 9 < e T] r° 


(43) 


Clearly, if the trace preconditioner is used, the proposed estimator for the global 
residual does not require any additional computation. 
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n+$ > 


u 



The discrepancy between the interface and global residuals and the accuracy 
of the proposed global residual estimator are demonstrated in FIG. 6 for the four 
subdomain bending plate problem with h = 1/40 (Section 3). 



7.2. Setting the tolerance for the PCG algorithm 

Let d n+1 , u£ +1 , and (u* +1 )°° denote respectively the exact solution of the 
nonlinear problem (1) at time step n + 1, the exact solution of the linearized 
problem at the A;-th nonlinear iteration of time step n+ 1, and the PCG solution 
of the linearized problem at the Ar- th nonlinear iteration of time step n + 1. Our 
stopping criterion for the PCG algorithm is: 


||K‘(u«) Au£f - r(u<‘> 1 )|| < ellK^SOAu^y 0 - r(„<‘» 1 )|| (44) 


where q denotes the PCG iteration number. It follows that: 


!u * +1 )°° - u fc+1 < 

t u n+l/ u n+l ^ 


u n+l - d„ + i 


(45) 
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On the other hand, a “Newton-like” method implies that: 

ll“S« - <Wi|| < c ||u* +1 - d„ +1 ||'> (46) 

where C is a real constant, and p is an integer describing the rate of convergence 
of the “Newton-like” method. For example, p = 2 for the Newton- Raphson algo- 
rithm, and p = — ^ = 1.618 for the Secant method [20]. From the triangular 

inequality for norms and from (45-46) we deduce: 

ll( u S 1 i) 0O - d »+i|l < « ll«S+i -d„+i|| + c ||u* +1 -d B+1 ||* (47) 

Therefore, if we desire that ||(u^)°° - d n+1 || < e, it suffices to take e = ep . 
For p = 2 (Newton-Raphson) this leads to e = y/i, and for p = 1.618 (Secant), 
this leads to e = e 0,618 . 

In the sequel, we fix e = 10 -6 . Therefore, we choose e = 1CT 4 , which 
accommodates both the Newton-Raphson and Secant methods. 

8. Serial performance analysis and numerical H non-scalability 

8.1 Nonlinear transient analysis of a space antenna connector 

As a first large-scale example, we consider the nonlinear transient analy- 
sis of a mechanical joint designed for connecting three space antennaes. The 
corresponding finite element (FIG. 7) mesh contains 11284 nodes, 21888 3-node 
shell elements with 6 d.o.f. per node, and a total of 67704 d.o.f. The time 
step At is chosen as T 3 /15 = 0.0123s., where T 3 denotes the third fundamental 
period of the structure. After the nodes are renumbered with the Reverse Cuthill- 
McKee (RCM) scheme [21], the average column height of the skyline of the un- 
decomposed matrix (M + ^K*) is 616 and its envelope size is 41,674,976. All 
computations discussed in this section axe carried out on a CRAY Y-MP proces- 
sor, and all performance results are reported for the solution of the displacement 
increments (Eqs. (12)) in one nonlinear step. 
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FIG. 7 Finite element discretization of a space antenna connector 

(from a finer mesh) 


8.2 Reference serial performance results 

First, performance results are reported for the direct method ( LDL T factor- 
ization) and the CG algorithm with diagonal scaling (Jacobi-PCG) applied to the 
un-decomposed problem (TABLE 5). These results will later serve as reference 
serial performance results. Memory consumption is measured in millions of 64 
bit words (MW). MFLOPS (Million FLoating-point Operations per Second) are 
reported in order to distinguish and independently assess the numerical and im- 
plementational performances. A sparse data structure is used for the Jacobi-PCG 
algorithm. 
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TABLE 5 

Space antenna connector - Reference serial performance results 


67704 equations - At = = 0.0123s. 

Convergence criterion for Jacobi-PCG: ^ AU "+ 1 ~ r ^ u n+i)ll2 < iq-4 

ll r (™»+.>H* 

CRAY Y-MP (single processor) 

algorithm memory # of itera tions MFLOPS CPU 

direct ( LDL T ) 42 MW 1 220 253.0 s. 

Jacobi-PCG 4.2 MW 3320 80 345.6 s. 


It is interesting to note that for the above problem, the direct method on the 
CRAY Y-MP outperforms the Jacobi-PCG algorithm essentially because it vec- 
torizes better. The MFLOP rate of LDL T is 2.75 times better than that of 
Jacobi-PCG, but its solution time is only 1.36 times faster. 

8.3 FETI performance results 

Next, we report on the performance results of the FETI method with the 
trace (TABLE 6) and dual (TABLE 7) preconditioners, for various numbers of 
subdomains. The growth of the interface problem with the number of subdomains 
is depicted in FIG. 8. All local subdomain problems are solved using the same 
implementation of the LDL ^ algorithm as in the un-decomposed case (TABLE 
5). FAC designates the computational phase where the subdomain operators are 
factored. All mesh partitions are obtained using the mesh decomposer described 
in Farhat [22]. 
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67704 equations - At = = 0.0123s. 


15 

1 9=N a 

Preconditioner: Tj = ^ B 3 (M S + 

5=1 


Convergence criterion: — AU , n t 1 — < io -4 

l|r(u^ 1 )|| a 


CRAY Y-MP (single processor) 



Ns 

memory 

CPU 

FAC 

MFLOPS 

FAC 

# of itr. 

CPU /itr. 
PCG 

CPU 

PCG 

MFLOPS 

PCG 

CPU 

TOT 


4 

18 MW 

19.0 s. 

165 

159 

0.43 s. 

68.4 s. 

130 

87.4 s. 


8 

16 MW 

13.6 s. 

150 

200 

0.39 s. 

78.0 s. 

110 

91.6 s. 


16 

12 MW 

9.6 s. 

130 

267 

0.36 s. 

96.1 s. 

105 

105.7 s. 


32 

10 MW 

6.7 s. 

105 

456 

0.34 s. 

155.0 s. 

90 

161.7 s. 


64 

9 MW 

5.0 s. 

85 

717 

0.33 s. 

236.6 s. 

75 

241.6 s. 
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TABLE 7 

~t -1 

Space antenna connector - FETI(Dj ) serial performance results 
67704 equations - At = = 0.0123s. 

~t _1 s=N, 

Preconditioner: Dj = ^ [B S (M* + ^-K* ) _1 B 3 ] -1 

3=1 

Convergence criterion: ^±1 its ^ ^ i n-4 

ll r (u ( „+ 1 )||2 

CRAY Y-MP (single processor) 


N a 

memory 

CPU 

FAC 

MFLOPS 

FAC 

# of itr. 

CPU /itr. 
PCG 

CPU 

PCG 

MFLOPS 

PCG 

CPU 

TOT 

4 

31 MW 

19.0 s. 

165 

Ill 

.77 s. 

85.5 s. 

130 

104.5 s. 

8 

26 MW 

13.6 s. 

150 

141 

.70s. 

98.7 s. 

110 

112.3 s. 

16 

20 MW 

9.6 s. 

130 

194 

.65 s. 

126.1 s. 

105 

135.7 s. 

32 

15 MW 

6.7 s. 

105 

381 

.61 s. 

232.4 s. 

90 

239.1 s. 

64 

12 MW 

5.0 s. 

85 

595 

.60 s. 

357.0 s. 

75 

362.8 s. 


Several observations are worthy discussing: 

• As expected, the FETI method with the dual preconditioner requires more 
memory than the FETI method with the trace preconditioner. On the other 
hand, the FETI method with the trace preconditioner offers substantial mem- 
ory savings over the global direct method. 

• The computational cost of both FETI algorithms is dominated by the solu- 
tion of the interface problem via the PCG method, especially for fine mesh 
partitions. When the number of subdomains is increased, the subdomain 
vectors become shorter and the MFLOP rate of the PCG algorithm on the 
CRAY Y-MP deteriorates. This MFLOP rate is slightly higher than that of 
the Jacobi-PCG algorithm for the undecomposed problem (TABLE 5) but is 
only half as high as that of the global direct method (TABLE 5). Therefore, 
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the toted CPU timings reported in TABLES 6-7 highlight the intrinsic com- 
putational superiority of the FETI methodology over both the direct and 
Jacobi-PCG globed algorithms. In the best case (4 subdomains), the FETI 
method with the trace preconditioner is three times faster than the globed 
direct solver and four times faster than the global Jacobi-PCG algorithm. 

• The dual preconditioner reduces the number of iterations performed by the 
trace preconditioner by a factor of 1.4 in the [4-16] subdomains range, and 
by a factor of 1.2 in the [32-64] subdomains range. However, a CG iteration 
with the dual preconditioner is 1.8 times more expensive than a CG iteration 
with the trace preconditioner, so that the trace preconditioner is overall more 
efficient. 

• The performance of both FETI algorithms deteriorates when the number 
of subdomain is increased (FIG. 9). We refer to this phenomenon as the 
numerical H non-scalability of the FETI methodology, Clearly, the super- 
convergence behavior discussed in Section 3 seems to disappear in the case 
of fine mesh partitions and the number of iterations seems to grow linearly 
with the interface size (FIG. 8). However, the reader should note that for 
the above structural dynamics problem, increasing the number of subdo- 
mains from 4 to 64 increases the number of iterations by a factor of 4.5, but 
increases the CPU time by a factor of 2.7 only (TABLE 6). This is because 
the cost of a PCG iteration decreases with the size of a subdomain. More- 
over, since the vector speed drops from 130 MFLOPS in the 4 subdomain 
case to 75 MFLOPS in the 64 subdomain case, increasing the number of 
subdomains from 4 to 64 actually increases the operation count of the FETI 
method by a factor of 2.7 x 75/130 = 1.6 only. 
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Space antenna connector 
67704 equations - Delta t = T3/15 = 0.0123 s. 



CPU (seconds) 


FETI( trace) 


o- FETl(duai) 


X 


direct 


Jacobi-PCG 


FIG. 9 Highlighting the H non-scalability 

9. Performance results 

The FETI methodology, the CG algorithm with diagonal scaling (Jacobi- 
PCG), and the LDL T direct method are implemented on both a CRAY Y-MP/8 
and an iPSC-860/128 parallel processors. Different optimal data structures are 
used for these different algorithms. All mesh partitions are obtained using the 
mesh decomposer described in [22]. 

On both parallel processors, the Jacobi-PCG algorithm is implemented in 
a subdomain-by-subdomain rather than element-by-element fashion in order to 
perform fewer floating-point operations. Within each subdomain, the transient 
tangent operator is assembled in sparse format. 

The implementation of the direct method on the CRAY Y-MP/8 multipro- 
cessor is described in Farhat [23]. The parallel skyline solver implemented on 
the iPSC-860/128 system is an improved version of the parallel LDL T algorithm 
presented in Farhat and Wilson [24]. In this improved version, a ring rather than 
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a tree communication algorithm [25] is used for broadcasting at each factorization 
step the pivotal column to all of the processors. The distributed data structure 
described in [24] is augmented with a pointer array that identifies the last active 
column of each structural equation. On the CRAY Y-MP/8 system, the forward 
and backward substitutions are serialized. On the iPSC-860/128 multiprocessor, 
only the backward substitution is serialized. These serializations are performed 
because of well-known mapping, synchronization, and communication bottlenecks 
in the parallel solution of skyline triangular systems [23, 24], 

The performances of the above parallel solvers are assessed with the nonlin- 
ear transient analysis of two structural systems: (a) the space antenna connector 
described in Section 8.1, and (b) a stiffened wing panel from the V22 tiltrotor 
aircraft (based on the panel described in Davis, Krishnamurthy, Stroud and Mc- 
Cleary [26]) (FIG. 10). The finite element model depicted in FIG. 10 contains 
9486 nodes, 9136 4-node shell elements with 6 d.o.f. per node, and a total of 
54216 d.o.f. 



FIG. 10 Finite element discretization of a stiffened wing panel 
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For both structural problems, the time step At is set to T 3 /15, where T 3 is 
the third fundamental period of the structure. This choice for At is reasonable for 
implicit schemes and leads to significantly larger time steps than those imposed 
by the Courant stability condition on explicit schemes (TABLE 8). 


TABLE 8 

Time step comparisons 


problem explicit implicit 

space antenna connector 5.70 10~ 7 s. 1.23 10 -2 s. 

stiffened wing panel 1.09 10“ 8 s. 1.79 10 -3 s. 


It is important to note that the condition number of the interface problem 

~t J=Ar ' A , . T 

F/ = Yj B j (M* + ^p-K‘ ) : B S strongly depends on the time step At. For 

3=1 


a sufficiently small time step, F j is “mass dominated” and therefore is rela- 
tively well-conditioned. For larger time steps such as those selected for the above 

two applications (TABLE 8), F 7 is “stiffness dominated” and is relatively ill- 
conditioned. 


9.1 Performance results on the CRAY Y-MP/8 system 

The performance results measured on the CRAY Y-MP/8 system for both 
structural problems are summarized in TABLES 9-10. The number of subdo- 
mains N 3 is set equal to the number of processors N p . Samples of the deformed 
configurations are shown in FIG. 11-12. 
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For both structural problems, the time step At is set to T 3 /15, where T 3 is 
the third fundamental period of the structure. This choice for At is reasonable for 
implicit schemes and leads to significantly larger time steps than those imposed 
by the Courant stability condition on explicit schemes (TABLE 8). 


TABLE 8 

Time step comparisons 


problem 


explicit 


space antenna connector 
stiffened wing panel 


implicit 


5.70 10“ 7 s. 1.23 10~ 2 s. 
1.09 10~ 8 s. 179 10~ 3 s. 


It is important to note that the condition/ number of the interface problem 

~t *=N. j j T / 

Fj = B 3 (M 3 — j-K* ) 1 B S stroi^ly depends on the time step At. For 

3=1 ^ / 

a sufficiently small time step, F 7 is /mass dominated” and therefore is rela- 
tively well-conditioned. For larger tifiie steps such as those selected for the above 

two applications (TABLE 8), F ^/is “stiffness dominated” and is relatively ill- 
conditioned. 


9.1 Performance results on the CRAY Y-MP/8 system 

The performance reshits measured on the CRAY Y-MP/8 system for both 
structural problems ar/ summarized in TABLES 9-10. The number of subdo- 
mains N 9 is set equa/to the number of processors N p . Samples of the deformed 
configurations are shown in FIG. 11-12. 
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TABLE 9 

Space antenna connector - performance results on the CRAY Y-MP/8 system 


67704 equations - At = y| = 0.01235. 


FETI preconditioner: 



*E*B s (M a + ^K t ‘)B 3T 


Convergence criterion: 


|K (iff 1 


±i 


Aul *+ l ) - r (<*> 1 )|| 2 


+ 1 

l | r ( u ^ 1 )|| 2 


< 10~ 4 


N p 

N, 

of itr. 
FETI 

ff of itr. 
Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

— 

1 

1 

— 

3320 

— 

345.6 s. 

253.0 s. 


4 

4 

159 

3320 

24.3 s. 

93.9 s. 

70.3 s. 

— 

8 

8 

200 

3320 

13.7 s. 

50.2 s. 

39.5 s. 



TABLE 10 

Stiffened wing panel - performance results on the CRAY Y-MP/8 system 
54216 equations - At = = 0.001795. 

— f 1 s=N, , _ T 

FETI preconditioner: T/ = B S (M S + -^-K* jB* 

3 = 1 


Convergence criterion: 


IlK (<> ) 


AU^ 1 ' 

ll r (uj*j l )ll 




< 10“ 4 


N p 

n 3 

of itr. 

FETI 

of itr. 

Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

1 

1 

— 

4775 

— 

418.6 s. 

259.7 s. 

4 

4 

300 

4775 

36.5 s. 

114.0 s. 

72.1 s. 

8 

8 

396 

4775 

20.9 s. 

60.8 s. 

40.6 s. 
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FIG. 11 Sample of the transient response of the space antenna connector 



FIG. 12 Sample of the transient response of the stiffened wing panel 
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All algorithms axe reported to achieve good speed-ups. This is essentially because 
the CRAY Y-MP/8 system has a shared memory and a relatively small number 
of processors. For the antenna connector problem, the FETI method is shown to 
outperform the direct solver by a factor of three and the Jacobi-PCG algorithm 
by a factor of four. The wing panel structure apparently generates a stiffer 
problem: the FETI method converges with a larger number of iterations than for 
the connector problem, but still outperforms the direct solver by a factor of two 
and the Jacobi-PCG algorithm by a factor of three. 

9.2. Performance results on the iPSC-860/128 multiprocessor 

The performance results measured on an iPSC-860/128 Touchstone Gamma mul- 
tiprocessor are summarized in TABLE 11 for the antenna connector problem and 
in TABLE 12 for the wing panel problem. A minimum of 32 and 64 processors are 
needed to accommodate the memory requirements of the FETI method and the 
direct solver, respectively. Our current implementation of the FETI algorithm on 
on this parallel processor allows only one processor per subdomain. Therefore, 
it is important to note that the FETI methodology is benchmarked here in the 
worst conditions, given that its intrinsic performance deteriorates in the presence 
of large numbers of subdomains (Section 8.3). 


TABLE 11 

Space antenna connector - performance results on the iPSC-860/128 system 
67704 equations - At = yg = 0.0123s. 

FETI preconditioner: Tj = B a (M a + ^■K t *)B' ,T 

5=1 

Convergence criterion: '- 1 {Un + l> V 1 — ■ "+ l)N2 < 10“ 4 

l|r<<0|| s 


N p 

Ns 

# of itr. 
FETI 

if of itr. 
Jacobi-PCG 

CPU 

FETI 

CPU 

Jacobi-PCG 

CPU 

direct 

32 

32 

456 


144.0 s. 

229.4 s. 



64 

64 

717 

3320 

144.1 s. 

144.8 s. 

943.3 s. 
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TABLE 12 


Stiffened wing panel - performance results on the iPSC-860/128 system 
54216 equations - At = y| = 0.00179s. 

~t~ i s=N, 

FETI preconditioner: Tj = B a (M 3 + )B S 

3=1 

Convergence criterion: ^ ^ Un +vl AU ^t 1 ~ r ( U "+ihh < ^q-4 


N p 

n 9 

of itr. 

ff of itr. 

CPU 

CPU 

CPU 



FETI 

Jacobi-PCG 

FETI 

Jacobi-PCG 

direct 

64 

64 

705 

4775 

145.0 s. 

227.0 s. 

776.2 s. 

128 

128 

1128 

4775 

136.4 s. 

160.9 s. 

845.4 s. 


For the space antenna connector problem, the FETI method is 1.6 times faster 
than the Jacobi-PCG algorithm for 32 processors and 32 subdomains, and 6.5 
times faster than the direct solver for 64 processors. For the stiffened wing panel 
problem, the FETI method is also 1.6 times faster than the Jacobi-PCG algorithm 
for 64 processors and 64 subdomains, and 5.4 and 6.3 times faster than the direct 
solver for 64 and 128 processors, respectively. However, for 128 processors and 
subdomains in the second problem, the FETI method is only 18% faster than the 
Jacobi-PCG scheme. This is partly because the superconvergence phenomenon 
disappears in the presence of fine mesh partitions, and partly because when the 
mesh size h is kept constant and the number of subdomains is increased, the 
local problems become smaller and the interface problem becomes larger, and 
the FETI method with a trace preconditioner converges towards a Jacobi-PCG 
scheme. 

The parallel direct solver exhibits poor performance results for both struc- 
tural problems. This is essentially because the bandwidth of the system of equa- 
tions associated with each problem is out-of-balance with the number of proces- 
sors needed to store K in a distributed skyline format. We remind the reader 
that the parallelism in direct skyline solvers scales with the bandwidth of the 
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problem and not with its size. Table 13 reports the repartition of the CPU tim- 
ings between compute, send and receive, for both the FETI method and parallel 
direct solver. 


TABLE 13 


Space antenna connector - compute v.s. send v.s. receive on the iPSC-860/128 system 


algorithm 

N p 

compute 

send 

receive 

dot product 

total CPU time 

FETI (T 7 ) 

64 

91.1 s. 

1.0 s. 

9.0 s. 

43.0 s. 

144.1 s. 

direct (factor) 

64 

54.4 s. 

50.7 s. 

623.0 s. 

— 

728.1 s. 

direct (forward) 

64 

1.4 s. 

0.1 s. 

20.0 s. 

— 

21.5 s. 

direct (backward) 

64 

0.4 s. 

1.3 s. 

192.0 s. 

— 

193.7 s. 

direct (total) 

64 

56.2 s. 

52.1 s. 

835.0 s. 


943.3 s. 


Clearly, the elapsed CPU time reported for the parallel direct solver is essentially 
spent in synchronization and interprocessor communication. On the other hand, 
the FETI method is shown to be communication efficient. The dot products 
performed during the solution of the interface problem are timed separately be- 
cause they are implemented via calls to the iNTEL system routine gdsum. We 
could not evaluate the repartition of these timings between compute, send and 
receive. If we can assume that they are equally distributed between computation 
and communication, then we can conclude that only 21.8% of the total CPU 
time reported for the FETI method is spent in synchronization interprocessor 
communication. 

The performance results summarized in TABLES 9-13 also demonstrate that 
the iPSC-860/128 Touchstone Gamma multiprocessor cannot compete with the 
CRAY Y-MP/8 system as far as direct solvers are concerned. However, for 
explicit-like computations such as those characterizing the FETI and Jacobi-PCG 
algorithms, 32 processors of the iPSC-860/128 system are shown to outperform 
one CRAY Y-MP processor. 

10. Circumventing the numerical H non-scalability 

The H non-scalability property characterizing the FETI methodology and 
discussed in Section 8 requires keeping the number of subdomains relatively sm all 
in order to obtain the best possible performance. However, increasing the num- 
ber of subdomains while keeping the mesh size constant seems a priori attractive 
because: (a) it reduces the cost of the local problems, and (b) it also reduces the 
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cost per iteration of the interface problem. Nevertheless, the results reported in 
TABLE 6 clearly indicate that refining beyond a certain number of subdomains 
the mesh partition does not significantly reduce any further the computational 
costs of neither the subdomain nor the interface problems. This can be explained 
by the fact that when the number of subdomains is increased, the subdomain 
bandwidth which governs the computational costs of both the subdomain and 
interface problems does not decrease as fast as the subdomain size. As an exam- 
ple, consider the N x N x N uniform discretization of a cubic region. A uniform 
mesh partitioning of this volume produces N s subdomains each containing N 3 /N a 
nodes and each having a local bandwidth equal to N 2 /N a ^ 3 (assuming 1 d.o.f. per 
node). If the number of subdomains N s is increased, the size of each local problem 
decreases as N s 1 while its local bandwidth decreases as N a 2 ^ 3 . Therefore, even 
strictly from a computational complexity viewpoint and independently from rate 
of convergence considerations, it is not necessarily advantageous to introduce a 
large number of subdomains. 

With the advent of massively parallel processors, the above discussion im- 
plies that one should allocate more than one processor to a given subdomain. 
The benefits of this strategy are illustrated in TABLE 14 for the space antenna 
connector problem. For 4 subdomains and 8 processors, the elapsed CPU time 
for the FETI method on the CRAY Y-MP/8 multiprocessor is 11.2 seconds, down 
from 13.7 seconds for 8 subdomains and 8 processors. Of course, significantly bet- 
ter performance improvements can be expected for the range [32-128] processors 
on the iPSC-860/128 system. 


TABLE 14 

Space antenna connector - circumventing the H non-scalability of FETI 
67704 equations - At — = 0.0123s. 

Preconditioner: T j = B S (M 3 + ^^K < *)B 3T 


Convergence criterion 


IK (uj,*jt) Au^-rtu^jjiu 


l|r(<5i)l|» 

CRAY Y-MP/8 


< 10 


-4 


N p N s # of itr. CPU 

1 4 159 87.4 s. 

4 4 159 24.3 s. 

8 4 159 11.2 s. 
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11. Closure 

A memory efficient domain decomposition (DD) method for implicit tran- 
sient dynamics has been presented. It is based on the concept of Finite Element 
Tearing and Interconnecting (FETI) developed by Farhat and Roux [8], Two 
subdomain-by-subdomain preconditioners with direct mechanical interpretations 
have been formulated. The first preconditioner, called here a trace precondi- 
tioner, is constructed as the sum of the projections on the subdomain interfaces 
of the inverses of the subdomain transient operators. The second preconditioner, 
called here a dual preconditioner, is constructed as the sum of the exact inverses 
of the subdomain transient operators. When preconditioned with the trace oper- 
ator, the interface problem behaves as if its condition number is independent of 
the mesh size. When preconditioned with the dual preconditioner, the interface 
problem has a condition number that is independent of the mesh size. Therefore, 
from a mathematical viewpoint, the dual preconditioner is optimal. However, the 
interface preconditioner is cheaper and computationally more efficient. For coarse 
mesh partitions, we have shown that the proposed FETI method can outperform 
on the CRAY Y-MP/8 multiprocessor the parallel direct solver and the parallel 
conjugate gradient algorithm with diagonal scaling by factors of 3 and 4, respec- 
tively. For fine mesh partitions, say more than 32 subdomains, the performance 
of the FETI algorithm degrades — and this is the case for most DD methods. 
Therefore, when working with a massively parallel processor, we recommend to 
keep the number of subdomains as small as possible and allocate more than one 
processor per subdomain. However, even in the worst case of one processor per 
subdomain, the FETI method still outperforms the parallel direct solver on the 
iPSC-860/128 system because of its low communication costs. Finally, the results 
we have reported for two realistic structural dynamics problems indicate that 32 
processors of an iPSC-860 Gamma system can outperform a CRAY Y-MP pro- 
cessor. 
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